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Statement  of  the  Problem  Studied 

The  CMU  Image  Understanding  program  performs  basic  research  and  technology  development  toward 
robust,  flexible,  and  precise  vision  systems  to  impact  a  wide  variety  of  military  and  civilian  applications. 
Our  areas  of  focus  include:  model-based  vision,  in  which  objects  are  recognized  from  prior  or  acquired 
solid  models;  3D  shape  inference,  in  which  image  physics  are  used  to  infer  3D  depth  from  one  or  more 
images;  and  vision  applications  including  mobile  robots,  robot  sensor  calibration,  and  human-computer 
interaction. 

Most  current  methods  for  computer  vision  still  depend,  for  their  low-level  analysis,  on  traditional  signal¬ 
processing  methods  such  as  edge  detection  and  pixel  clustering.  In  contrast,  our  research  on  physical 
models  for  computer  vision  addresses  modeling  physical  processes,  such  as  laws  of  reflection,  image 
formation,  and  object  and  sensor  properties.  These  explicit  models  can  cope  with  highlights,  shadows, 
surface  texture,  and  other  phenomena  that  cause  complex  variations  in  intensity  and  color. 

Current  vision  algorithms  are  designed  as  static  systems;  they  use  preprogrammed  structures  and 
parameters  even  after  recognition  and  processing  failures  due  to  environmental  variations  and 
discrepancies  between  models  and  reality.  Our  research  aims  to  develop  learning  techniques  that  can 
overcome  such  discrepancies  and  adapt  to  new  environments.  These  learning  algorithms  are  developed 
and  tested  in  task-oriented  vision  problems  rather  than  on  a  traditional  abstract  machine-learning  problem. 
Such  task-oriented  problems  include:  learning  appearance  models  for  virtual  reality  systems,  learning  the 
concept  of  human  faces  from  examples,  learning  land-mark  models  for  outdoor  navigation,  learning  the 
SAR  recognition  program  from  examples,  and  learning  what  to  do  by  observing  human  actions. 
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Summary  of  the  Most  Important  Results 


Hvpergeometric  filter-based  image  matching 

A  hypergeometric  filter-based  approach  has  been  developed  for  general  image  matching  problems.  Not 
only  is  it  applicable  to  a  wide  range  of  matching  problems  such  as  focus,  stereo,  optical  flow,  and  affine 
matching;  we  can  achieve  much  higher  precision  using  this  approach  than  traditional  approaches  because 
window  and  foreshortening  effects  are  eliminated.  In  this  approach,  the  effects  of  the  finite  window  size 
can  be  expressed  as  high  order  terms  in  a  Taylor  expansion.  Ignoring  those  window  effects  in  traditional 
approaches  is  equivalent  to  truncating  the  Taylor  expansion  after  the  first  term.  Therefore,  by  truncating 
the  expansion  at  a  higher  order,  we  are  able  to  reduce,  and  numerically  eliminate,  window  effects. 
Furthermore,  the  shift  variance  effects  caused  by  non-zero  gradients  of  matching  parameters  such  as 
foreshortening  in  stereo  and  affine  deformation  in  optical  flow,  can  be  represented  analytically  as  a  linear 
combination  of  filter  outputs.  The  hypergeometric  filter  achieves  very  high  precision  by  taking  both 
window  and  shift  variance  effects  into  consideration.  Using  the  hypergeometric  filter  approach,  we 
experimented  with  image  matching  problems  such  as  depth  from  defocus  with  and  without  slope 
estimation,  depth  from  stereo  with  and  without  foreshortening  estimation,  optical  flow,  and  affine 
matching.  In  all  those  experiments,  our  new  approach  produced  higher  precision  results  than  those  state- 
of-art  techniques  designed  specifically  for  individual  problems. 


Multi-body  Factorization  for  Structure  from  Motion: 

Structure  fforri  motion  (SFM)  has  been  one  of  the  most  active  research  areas  in  computer  vision  during 
the  past  two  decades.  Most  of  the  SFM  methods,  however,  neglect  the  study  of  a  multi-body  problem. 
Rather,  they  are  based  on  the  assumption  that  only  a  single  motion  is  included  in  the  image  sequence; 
either  the  environment  is  static  and  the  observer  moves,  or  the  observer  is  static  and  only  one  object  in  the 
scene  is  in  motion.  More  difficult  and  less  studied  is  the  general  case  of  an  unknown  number  of  objects 
moving  independently.  At  CMU  we  have  developed  the  factorization  method  for  robust  structure  from 
motion:  the  initial  orthographic  factorization,  the  para-perspective  factorization,  and  the  sequential 
factorization.  Yet,  all  of  these  methods,  as  well  as  other  previous  methods,  can  deal  with  only  a  single¬ 
body  problem. 

We  have  developed  a  new  method  for  separating  and  recovering  the  motion  and  shape  of  multiple, 
independently  moving  objects  in  a  sequence  of  images.  This  new  method  does  not  require  any  grouping 
of  features  into  an  object  at  the  image  level;  nor  does  it  require  prior  knowledge  of  the  number  of  objects. 
The  key  idea  was  the  introduction  of  a  mathematical  construct  of  object  shapes,  called  the  shape 
interaction  matrix,  which  is  invariant  to  both  the  object  motions  and  the  selection  of  coordinate  systems. 
This  invariant  structure  is  computable  solely  from  the  observed  trajectories  of  image  features  without 
grouping  them  into  individual  objects.  Once  the  matrix  is  computed,  it  allows  for  segmenting  features  into 
objects,  as  well  as  for  recovering  the  shape  and  motion  of  each  object  by  the  process  of  transforming  it 
into  a  canonical  form.  The  method  has  been  tested  successfully  with  simulated  data  and  simple  real 
image  sequences.  This  method  remains  to  be  the  only  non-heuristic  method  for  the  multi-body  structure 
from  motion  problem. 


Trainable  face  detection 


We  developed  a  trainable  face  detection  system  that  can  locate  all  upright,  frontal  faces  in  a  scene.  The 
faces  can  be  of  any  size  and  can  appear  against  arbitrary  backgrounds.  The  system  uses  color  and  motion 
cues  to  restrict  its  search,  and  can  process  a  320x240  pixel  image  in  less  than  a  second  on  an  SGI  Indy 
workstation  and  later  in  more  than  5  frames  per  second  on  Pentium  n  PCs.  The  system  has  been  used  in 
many  application  systems  within  CMU  and  outside  of  CMU,  including  image  retrieval  and  news-on- 
demand  system  for  quick  access  to  video  information,  human-computer  interaction  systems. 


3D  Surface  Representation  from  Multiple  Range  Images 

For  acquiring  surface  representation,  we  developed  a  system  that  creates  3D  surface  representations  from 
range  images  of  the  object.  The  method  consists  of  acquiring  several  range-image  views  of  the  object, 
aligning  the  image  data,  merging  the  image  data  using  the  aid  of  a  volumetric  representation,  and  then 
extracting  a  triangle  mesh  from  the  volumetric  representation  of  the  merged  data.  Our  main  contribution 
is  a  new  algorithm,  the  consensus-surface  algorithm,  which  eliminates  many  of  the  troublesome  effects  of 
noise  and  extraneous  surface  observations  in  the  data.  It  does  so  by  searching  for  a  consensus  of  surface 
observations  in  order  to  estimate  the  implicit  distance  from  each  point  in  the  volume  to  the  closest  point 
on  the  surface.  This  algorithm  can  produce  accurate  object  models  despite  the  poor  quality  of  data 
available  from  real  imagery  (for  both  range  and  intensity  images). 

Learning  of  Object  Appearance  Model 

Generating  realistic  images  of  a  three  dimensional  object  for  virtual  reality  systems  requires  two  pieces  of 
information:  the  object’s  shape  (geometric  information)  and  reflectance  properties  (photometric 
information)  such  as  color  and  specularity.  While  significant  progress  has  been  made  in  computer 
graphics  hardware  and  image  rendering  algorithms,  object  models  are  still  created  manually  -  a 
bottleneck  for  realistic  image  synthesis. 

We  have  developed  a  novel  approach  to  learn  photometric  information  as  well  as  geometric  information 
of  an  object  by  simply  observing  a  real  object.  This  method  not  only  skips  the  time-consuming  manual 
modeling,  but  also  provides  a  much  more  realistic  and  accurate  appearance  of  an  object  when  generated 
by  the  virtual  reality  system.  The  method  utilizes  a  series  of  color  images  of  an  object  under  a  moving 
light  source.  Then,  it  observes  the  color  transition  at  each  pixel,  and  records  it  into  the  four  dimensional 
RGB-Time  space,  referred  to  as  the  temporal-color  space.  The  color  transition  curve  in  the  temporal-color 
space  can  be  decomposed  into  diffuse  and  specular  component  curves  using  the  singular  value 
decomposition  method.  Due  to  the  dichromatic  theory,  those  two  curves  exist  on  two  hyper-planes  in  the 
temporal-color  space.  By  analyzing  those  two  curves,  such  as  width  and  height  on  the  two  hyperplane,  the 
method  acquires  geometric  and  photometric  information  of  an  object. 

Recognition  of  3D  Objects  in  Range  Images  bv  the  Spin  Image  Method 

We  have  developed  a  representation  that  combines  the  descriptiveness  of  global  object  properties  with  the 
robustness  to  partial  views  and  clutter  of  local  shape  descriptions.  A  local  basis  is  computed  at  an  oriented 
point  (3-D  point  with  surface  normal)  on  the  surface  of  an  object.  All  the  positions  on  the  object  surface 
now  can  be  described  with  respect  to  the  basis  of  other  points  by  two  parameters.  By  accumulating  these 
parameters  in  a  2-D  array,  a  descriptive  image  (spin-image)  associated  with  the  point  is  created.  Because 
spin-images  describes  the  coordinates  of  points  on  the  surface  of  an  object  with  respect  to  the  local  basis, 
they  are  local  encoding  of  the  global  shape  of  the  object  and  are  invariant  to  rigid  transformations. 

At  recognition  time,  spin-images  from  points  on  the  model  are  compared  with  spin-images  from  points  in 
the  scene;  when  two  images  are  similar  enough,  a  point  correspondence  between  model  and  scene  is 
established.  After  point  matching,  a  model  is  localized  in  the  scene  by  grouping  correspondences  to 
compute  a  transformation,  which  is  subsequently  refined,  and  verified  using  a  modified  iterative  closest 
point  registration  algorithm. 

This  recognition  algorithm  has  been  integrated  into  a  semi-automatic  world  modeling  system  called 
Artisan.  Artisan  combines  3-D  sensors,  object  modeling  and  analysis  software,  and  an  operator  interface 
to  create  a  3-D  model  of  a  robot’s  work  area.  Through  object  recognition.  Artisan  assigns  semantic 
meaning  to  objects  in  the  scene,  which  facilitates  execution  of  robotic  commands  and  drastically 
simplifies  operator  interaction.  Artisan  was  demonstrated  in  several  tasks  at  the  Oakridge  National  Labs, 
using  a  remotely  operated  mobile  platform. 
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Object  Recognition  in  SAR  Images 


Automatic  target  recognition  (ATR)  using  synthetic  aperture  radar  (SAR)  images  is  an  important  militeuy 
application  area.  SAR  sensors  allow  continuous  day/night  coverage  under  all  weather  conditions,  and  can 
achieve  high  spatial  resolution  even  from  orbital  platforms. 

We  developed  a  trainable  SAR  ATR  system  based  on  a  new  technique  “eigenwindows”.  This  system 
divides  each  training  image  into  small  subwindows,  all  of  which  are  stored  as  points  in  the  eigen  space. 
An  unknown  target  image  is  also  broken  into  subwindows  and  projected  to  the  eigen  space.  Each  pairing 
of  a  target  eigenwindow  point  and  a  training  point  votes  for  a  particular  target  and  viewing  angle,  and  the 
final  classification  is  achieved  as  the  consensus  of  all  such  votes.  This  eigenwindow  approach  has  a 
number  of  benefits.  First,  when  some  parts  of  a  target  are  occluded,  remaining  windows  covering  visible 
parts  can  identify  the  target.  Second,  to  detect  a  target  with  articulated  components,  we  can  define 
separate  windows  for  each,  and  recognition  can  proceed  separately  on  the  articulated  parts  and  the  body. 
Third,  the  method  is  by  definition  insensitive  to  image  translation.  Finally,  using  multiple  small  windows 
rather  than  a  whole  image  greatly  reduces  the  dimensionality  of  the  eigen  spaces  that  must  be 
manipulated. 

The  eigenwindow-based  SAR  ATR  system  was  evaluated  it  using  seven  targets  types:  BMP,  BTR60, 
KTANK,  M35,  Ml  13,  M60  and  SCUD.  Training  images  for  each  target  were  generated  via  the  XPATCH 
simulator  by  varying  the  azimuth  angle  from  0  to  359  degrees  in  1  degree  increments,  while  maintaining  a 
constant  SAR  depression  angle  of  22.5  degrees  and  resolution  of  30  cm/pixel.  Test  images  were  also 
generated  via  XPATCH,  at  fractional  azimuth  values.  A  target  classification  produced  by  the  system  was 
considered  to  be  correct  if  it  was  of  the  correct  object  type,  and  had  an  estimated  azimuth  angle  within  5 
degrees  of  the  correct  angle.  Under  this  criteria,  when  the  system  was  tasked  to  produce  a  single,  best 
candidate  hypothesis,  the  mean  classification  accuracy  was  95%  (std  of  4%)  for  unoccluded  targets,  and 
93%  (std  5%)  for  targets  occluded  up  to  50%  in  the  worst  case. 


Shape  Matching  Technique  and  its  Medical  Application 

A  shape  matching  (or  registration)  method  based  on  the  iterative  closest  point  algorithm  has  been 
developed  and  applied  to  computer-assisted  surgical  systems.  The  registration  process  is  a  fundamental 
component  of  most  computer-assisted  surgical  systems.  Registration  estimates  a  spatial  transformation 
between  two  coordinate  systems:  a  pre-operative  system  used  to  construct  plans  or  simulations  based 
upon  medical  data  (e.g.,  CT,  MRI,  or  X-ray  images),  and  an  intra-operative  system  in  which  the  surgical 
procedure  is  performed  (e.g.,  relative  to  a  robot,  navigational  guidance  system,  etc.). 

This  work  addresses  the  problem  of  improving  shape-based  registration  accuracy  via  intelligent  selection 
of  registration  data  and  on-line  estimation  of  accuracy.  Intelligent  data  selection  (IDS)  is  comprised  of 
geometric  constraint  anedysis,  which  provides  a  sensitivity  measure  shown  to  be  well  correlated  with 
registration  accuracy;  and  geometric  constraint  synthesis,  an  optimization  process,  which  generates  data 
configurations,  which  maximize  the  sensitivity  measure  for  a  fixed  quantity  of  data.  IDS  use  the  pre¬ 
operative  shape  representation  to  generate  a  data  collection  plan  (DCP),  which  can  be  used  during  surgery 
to  guide  the  acquisition  of  registration  data.  On-line  accuracy  estimation  provides  an  upper  bound  on  true 
registration  accuracy  based  upon  a  conventional  root-mean-squared  error. 

After  in-vitro  on  cadaveric  specimens  and  via  simulation  studies,  the  above  method  has  been  incorporated 
into  the  HipNav  system,  a  clinical  image-guided  orthopedic  surgical  system,  which  has  been  used  for 
more  than  100  actual  surgeries. 

Handling  Indeterminacy  and  Uncertainty  in  Computer  Vision 

Parameter  indeterminacies  are  inherent  in  3D  computer  vision.  However,  there  has  not  been  a  general 
and  convenient  method  available  for  representing  and  analyzing  the  indeterminacies  and  their  effects  on 


accuracy.  Consequently,  up  to  the  present  their  effects  are  usually  ignored  in  uncertainty  modeling 
research.  We  developed  gauge-based  uncertainty  representation  for  3D  estimation  that  includes 
indeterminacies.  We  represent  indeterminacies  with  orbits  in  the  parameter  space  and  model  local 
linearized  parameter  indeterminacies  as  gauge  freedoms.  Combining  this  formalism  with  first  order 
perturbation  theory,  we  are  able  to  model  uncertainties  along  with  parameter  indeterminacies. 

The  key  to  our  work  is  a  geometric  interpretation  of  the  parameters  and  gauge  freedoms.  We  solve  the 
problem  of  how  to  compare  parameter  uncertainties  despite  indeterminacies  and  added  constraints.  This 
permits  us  to  extend  the  Cramer-Rao  lower  bound  to  problems  that  include  parameter  indeterminacies.  In 
3D  computer  vision  the  basic  quantities  that  often  cannot  be  recovered  include  scale,  rotation  and 
translation.  We  use  our  method  to  analyze  the  local  effects  of  these  indeterminacies  on  the  estimated 
shape,  and  find  all  the  local  gauge  freedoms.  This  enables  us  to  express  the  uncertainties  when  additional 
information  is  available  from  measurements  that  constrain  the  gauge  freedoms.  Through  analytical  and 
empirical  means  we  gain  intuition  into  the  effects  of  constraining  the  gauge  freedoms,  for  both  general 
Structure  from  Motion  and  stereo  shape  estimation.  We  include,  in  our  uncertainty  model,  measurement 
errors  and  feature  localization  errors.  These  results  along  with  our  theory  allow  us  to  find  optimal 
constraints  on  the  gauge  freedoms  that  maximize  the  accuracy  of  the  part  of  the  object  we  seek  to 
estimate. 
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